40 research outputs found

    A random matrix analysis and improvement of semi-supervised learning for large dimensional data

    Full text link
    This article provides an original understanding of the behavior of a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data. It is demonstrated that the intuition at the root of these methods collapses in this limit and that, as a result, most of them become inconsistent. Corrective measures and a new data-driven parametrization scheme are proposed along with a theoretical analysis of the asymptotic performances of the resulting approach. A surprisingly close behavior between theoretical performances on Gaussian mixture models and on real datasets is also illustrated throughout the article, thereby suggesting the importance of the proposed analysis for dealing with practical data. As a result, significant performance gains are observed on practical data classification using the proposed parametrization

    Consistent Semi-Supervised Graph Regularization for High Dimensional Data

    Full text link
    Semi-supervised Laplacian regularization, a standard graph-based approach for learning from both labelled and unlabelled data, was recently demonstrated to have an insignificant high dimensional learning efficiency with respect to unlabelled data (Mai and Couillet 2018), causing it to be outperformed by its unsupervised counterpart, spectral clustering, given sufficient unlabelled data. Following a detailed discussion on the origin of this inconsistency problem, a novel regularization approach involving centering operation is proposed as solution, supported by both theoretical analysis and empirical results

    Lead exposure assessment among pregnant women, newborns, and children: case study from Karachi, Pakistan.

    Get PDF
    Lead (Pb) in petrol has been banned in developed countries. Despite the control of Pb in petrol since 2001, high levels were reported in the blood of pregnant women and children in Pakistan. However, the identification of sources of Pb has been elusive due to its pervasiveness. In this study, we assessed the lead intake of pregnant women and one- to three-year-old children from food, water, house dust, respirable dust, and soil. In addition, we completed the fingerprinting of the Pb isotopic ratios (LIR) of petrol and secondary sources (food, house-dust, respirable dust, soil, surma (eye cosmetics)) of exposure within the blood of pregnant women, newborns, and children. Eight families, with high (~50 μg/dL), medium (~20 μg/dL), and low blood levels (~10 μg/dL), were selected from 60 families. The main sources of exposure to lead for children were food and house-dust, and those for pregnant women were soil, respirable dust, and food. LIR was determined by inductively coupled plasma quadrupole mass spectrometry (ICP-QMS) with a two sigma uncertainty of ±0.03%. The LIR of mothers and newborns was similar. In contrast, surma, and to a larger extent petrol, exhibited a negligible contribution to both the child’s and mother’s blood Pb. Household wet-mopping could be effective in reducing Pb exposure. This intake assessment could be replicated for other developing countries to identify sources of lead and the burden of lead exposure in the population

    miR-486-3p Influences the Neurotoxicity of a-Synuclein by Targeting the SIRT2 Gene and the Polymorphisms at Target Sites Contributing to Parkinson’s Disease

    Get PDF
    Background/Aims: Increasing evidence suggests the important role of sirtuin 2 (SIRT2) in the pathology of Parkinson’s disease (PD). However, the association between potential functional polymorphisms in the SIRT2 gene and PD still needs to be identified. Exploring the molecular mechanism underlying this potential association could also provide novel insights into the pathogenesis of this disorder. Methods: Bioinformatics analysis and screening were first performed to find potential microRNAs (miRNAs) that could target the SIRT2 gene, and molecular biology experiments were carried out to further identify the regulation between miRNA and SIRT2 and characterize the pivotal role of miRNA in PD models. Moreover, a clinical case-control study was performed with 304 PD patients and 312 healthy controls from the Chinese Han population to identify the possible association of single nucleotide polymorphisms (SNPs) within the miRNA binding sites of SIRT2 with the risk of PD. Results: Here, we demonstrate that miR-486-3p binds to the 3’ UTR of SIRT2 and influences the translation of SIRT2. MiR-486-3p mimics can decrease the level of SIRT2 and reduce a-synuclein (α-syn)-induced aggregation and toxicity, which may contribute to the progression of PD. Interestingly, we find that a SNP, rs2241703, may disrupt miR-486-3p binding sites in the 3’ UTR of SIRT2, subsequently influencing the translation of SIRT2. Through the clinical case-control study, we further verify that rs2241703 is associated with PD risk in the Chinese Han population. Conclusion: The present study confirms that the rs2241703 polymorphism in the SIRT2 gene is associated with PD in the Chinese Han population, provides the potential mechanism of the susceptibility locus in determining PD risk and reveals a potential target of miRNA for the treatment and prevention of PD

    Méthodes des matrices aléatoires pour l’apprentissage en grandes dimensions

    No full text
    The BigData challenge induces a need for machine learning algorithms to evolve towards large dimensional and more efficient learning engines. Recently, a new direction of research has emerged that consists in analyzing learning methods in the modern regime where the number n and the dimension p of data samples are commensurately large. Compared to the conventional regime where n>>p, the regime with large and comparable n,p is particularly interesting as the learning performance in this regime remains sensitive to the tuning of hyperparameters, thus opening a path into the understanding and improvement of learning techniques for large dimensional datasets.The technical approach employed in this thesis draws on several advanced tools of high dimensional statistics, allowing us to conduct more elaborate analyses beyond the state of the art. The first part of this dissertation is devoted to the study of semi-supervised learning on high dimensional data. Motivated by our theoretical findings, we propose a superior alternative to the standard semi-supervised method of Laplacian regularization. The methods involving implicit optimizations, such as SVMs and logistic regression, are next investigated under realistic mixture models, providing exhaustive details on the learning mechanism. Several important consequences are thus revealed, some of which are even in contradiction with common belief.Le défi du BigData entraîne un besoin pour les algorithmes d'apprentissage automatisé de s'adapter aux données de grande dimension et de devenir plus efficace. Récemment, une nouvelle direction de recherche est apparue qui consiste à analyser les méthodes d’apprentissage dans le régime moderne où le nombre n et la dimension p des données sont grands et du même ordre. Par rapport au régime conventionnel où n>>p, le régime avec n,p sont grands et comparables est particulièrement intéressant, car les performances d’apprentissage dans ce régime restent sensibles à l’ajustement des hyperparamètres, ouvrant ainsi une voie à la compréhension et à l’amélioration des techniques d’apprentissage pour ces données de grande dimension.L'approche technique de cette thèse s'appuie sur des outils avancés de statistiques de grande dimension, nous permettant de mener des analyses allant au-delà de l'état de l’art. La première partie de la thèse est consacrée à l'étude de l'apprentissage semi-supervisé sur des grandes données. Motivés par nos résultats théoriques, nous proposons une alternative supérieure à la méthode semi-supervisée de régularisation laplacienne. Les méthodes avec solutions implicites, comme les SVMs et la régression logistique, sont ensuite étudiées sous des modèles de mélanges réalistes, fournissant des détails exhaustifs sur le mécanisme d'apprentissage. Plusieurs conséquences importantes sont ainsi révélées, dont certaines sont même en contradiction avec la croyance commune

    Methods of random matrices for large dimensional statistical learning

    No full text
    Le défi du BigData entraîne un besoin pour les algorithmes d'apprentissage automatisé de s'adapter aux données de grande dimension et de devenir plus efficace. Récemment, une nouvelle direction de recherche est apparue qui consiste à analyser les méthodes d’apprentissage dans le régime moderne où le nombre n et la dimension p des données sont grands et du même ordre. Par rapport au régime conventionnel où n>>p, le régime avec n,p sont grands et comparables est particulièrement intéressant, car les performances d’apprentissage dans ce régime restent sensibles à l’ajustement des hyperparamètres, ouvrant ainsi une voie à la compréhension et à l’amélioration des techniques d’apprentissage pour ces données de grande dimension.L'approche technique de cette thèse s'appuie sur des outils avancés de statistiques de grande dimension, nous permettant de mener des analyses allant au-delà de l'état de l’art. La première partie de la thèse est consacrée à l'étude de l'apprentissage semi-supervisé sur des grandes données. Motivés par nos résultats théoriques, nous proposons une alternative supérieure à la méthode semi-supervisée de régularisation laplacienne. Les méthodes avec solutions implicites, comme les SVMs et la régression logistique, sont ensuite étudiées sous des modèles de mélanges réalistes, fournissant des détails exhaustifs sur le mécanisme d'apprentissage. Plusieurs conséquences importantes sont ainsi révélées, dont certaines sont même en contradiction avec la croyance commune.The BigData challenge induces a need for machine learning algorithms to evolve towards large dimensional and more efficient learning engines. Recently, a new direction of research has emerged that consists in analyzing learning methods in the modern regime where the number n and the dimension p of data samples are commensurately large. Compared to the conventional regime where n>>p, the regime with large and comparable n,p is particularly interesting as the learning performance in this regime remains sensitive to the tuning of hyperparameters, thus opening a path into the understanding and improvement of learning techniques for large dimensional datasets.The technical approach employed in this thesis draws on several advanced tools of high dimensional statistics, allowing us to conduct more elaborate analyses beyond the state of the art. The first part of this dissertation is devoted to the study of semi-supervised learning on high dimensional data. Motivated by our theoretical findings, we propose a superior alternative to the standard semi-supervised method of Laplacian regularization. The methods involving implicit optimizations, such as SVMs and logistic regression, are next investigated under realistic mixture models, providing exhaustive details on the learning mechanism. Several important consequences are thus revealed, some of which are even in contradiction with common belief

    Semi-supervised Spectral Clustering

    No full text
    International audienceIn this article, we propose a semi-supervised version of spectral clustering, a widespread graph-based unsupervised learning method. The semi-supervised spectral clustering has the advantage of producing consistent classification of data with sufficiently large number of labelled or unlabelled data, unlike classical graph-based semi-supervised methods which are only consistent on labelled data. Theoretical arguments are provided to support the proposition of this novel approach, as well as empirical evidence to confirm the theoretical claims and demonstrate its superiority over other graph-based semi-supervised methods
    corecore